-
Notifications
You must be signed in to change notification settings - Fork 393
feat: Add read support for Parquet bloom filters #2653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add read support for Parquet bloom filters #2653
Conversation
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ForeverAngry for working on this. Since this changes the specification, we have to go through an Iceberg improvement proposal to ensure that there is concensus across different implementations.
As part of the change process, my main question would be; what's the added value on top the bloom filters that are embedded in the Parquet files.
pyiceberg/manifest.py
Outdated
| NestedField( | ||
| field_id=146, | ||
| name="bloom_filter_bytes", | ||
| field_type=MapType(key_id=147, key_type=IntegerType(), value_id=148, value_type=BinaryType()), | ||
| required=False, | ||
| doc="Map of column id to bloom filter", | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot just add a field; this requires a spec change: https://iceberg.apache.org/contribute/#apache-iceberg-improvement-proposals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko take a look now, i changed the spirit of the PR so that it:
- doesn't modify the Iceberg specification
- Doesn't change any existing behavior
Rather, this pr just provides the initial utilities needed to read bloom filters from Parquet files at the file level.
If merged, next steps would be integrating them into the read path.
|
@Fokko I was kinda thinking that when I submitted it. But, it was done, and I figured id just send it to see if it sparked any interest. That's good information though, sometimes I forget about the governance structures that exist for these projects. |
I guess, to me, the main benefit would be the ability to do file-level pruning before opening any files. As a result, this would also come with some secondary benefits like being able to do row group-level pruning within a Parquet file after opening it. |
ee5901f to
d6a123b
Compare
Add utility functions to read and check bloom filters directly from Parquet files using PyArrow, without requiring Iceberg spec changes. - get_parquet_bloom_filter_for_column(): Extract bloom filter from Parquet row group - bloom_filter_might_contain(): Check if value might be in bloom filter This provides foundation for future bloom filter integration without modifying the Iceberg manifest specification.
d6a123b to
55e8516
Compare
This is correct, and it would give some optimization. However, bloom filters need to be tuned based on the cardinallity of the column. The price to pay is that we encode more information in the manifest files.
I would expect Pyarrow to do this automatically 🤔 |
Fair enough, if the "Juice isnt work the squeeze" as they say, then no big deal 😄 |
Closes #2649
Rationale for this change
Add support for using bloom filters in the read path of pyiceberg.
Are these changes tested?
Yes.
Are there any user-facing changes?
I dont think so.